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Abstract 

Background: Genomic prediction in multiple populations can be viewed as a multi-task learning problem where 
tasks are to derive prediction equations for each population and multi-task learning property can be improved by 
sharing information across populations. The goal of this study was to develop a multi-task Bayesian learning model 
for multi-population genomic prediction with a strategy to effectively share information across populations. Simulation 
studies and real data from Holstein and Ayrshire dairy breeds with phenotypes on five milk production traits were used 
to evaluate the proposed multi-task Bayesian learning model and compare with a single-task model and a simple data 
pooling method. 

Results: A multi-task Bayesian learning model was proposed for multi-population genomic prediction. Information was 
shared across populations through a common set of latent indicator variables while SNP effects were allowed to vary 
in different populations. Both simulation studies and real data analysis showed the effectiveness of the multi-task model 
in improving genomic prediction accuracy for the smaller Ayshire breed. Simulation studies suggested that the 
multi-task model was most effective when the number of QTL was small (n = 20), with an increase of accuracy by 
up to 0.09 when QTL effects were lowly correlated between two populations (p = 0.2), and up to 0.1 6 when QTL effects 
were highly correlated (p = 0.8). When QTL genotypes were included for training and validation, the improvements 
were 0.16 and 0.22, respectively, for scenarios of the low and high correlation of QTL effects between two populations. 
When the number of QTL was large (n = 200), improvement was small with a maximum of 0.02 when QTL genotypes 
were not included for genomic prediction. Reduction in accuracy was observed for the simple pooling method when 
the number of QTL was small and correlation of QTL effects between the two populations was low. For the real data, 
the multi-task model achieved an increase of accuracy between 0 and 0.07 in the Ayrshire validation set when 28,206 
SNPs were used, while the simple data pooling method resulted in a reduction of accuracy for all traits except for 
protein percentage. When 246,668 SNPs were used, the accuracy achieved from the multi-task model increased by 
0 to 0.03, while using the pooling method resulted in a reduction of accuracy by 0.01 to 0.09. In the Holstein 
population, the three methods had similar performance. 

Conclusions: Results in this study suggest that the proposed multi-task Bayesian learning model for multi-population 
genomic prediction is effective and has the potential to improve the accuracy of genomic prediction. 

Keywords: Multi-task learning, Bayesian model, Multi-population, Genomic prediction, Stochastic search variable selection 



Background 

Genomic prediction has become a new tool for selection 
of candidates based on genomic estimated breeding values 
(GEBV) through the use of dense markers covering the 
whole genome [1]. To predict GEBV, a training data set 
with genotypes and phenotypes is used to derive the 
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prediction equations, where all marker effects are estimated 
simultaneously. GEBV for selection candidates that 
have genotypes are then predicted by summing up all 
the marker effects. The accuracy of GEBV is affected by 
several factors [2,3], of which the number of individuals in 
the training data set and the marker density are of crucial 
importance [2,3]. 

In Holstein dairy cattle, genomic prediction has been 
successfully applied using the Illumina BovineSNP50 
single nucleotide polymorphism (SNP) panel [4,5]. For 
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smaller populations such as Ayrshire in dairy cattle, 
acquisition of a large number of animals to be included 
in the training data set for genomic prediction still 
remains a challenge. One strategy is to combine data of 
the small populations with data from other populations 
to increase the size of the training set. However, simply 
pooling data from different populations may result in 
unfavorable accuracies if the marker density is low or 
the populations have diverged for a long time [6-8]. 
Increasing the marker density is one of the possible 
solutions because the linkage disequilibrium (LD) phase 
persistence between markers and quantitative trait loci 
(QTL) among different populations would likely to im- 
prove. However, a recent study in Jersey and Holstein 
dairy cattle reported a very limited advantage by using a 
very high density SNP panel [9] . Few studies have been 
dedicated to research new methods or strategies other 
than simply pooling data together for genomic predic- 
tion. Brondum et al. [10] proposed an approach called 
BayesRS for multi-population genomic prediction, where 
a location specific genetic variance derived in one popula- 
tion were used as priors for another population. They 
found that for some traits, BayesRS might be advantageous 
compared to the approach of simply pooling training data 
sets for distantly-related populations; but for closely related 
populations the method did not perform better than simply 
pooling data together. 

Multi-task learning, the term first coined by R Caruana 
[11], aims to improve learning performance by simultan- 
eously learning from related tasks. In text and speech 
recognition, image reconstruction and many other areas 
where data are collected from multiple sources, multi-task 
learning has been successfully applied [12-14]. Recently, 
the multi-task learning has attracted a growing interest 
in biological science for sequence and gene expression 
analyses [15-17] as well as genome-wide association 
studies (GWAS) [18]. To our knowledge, the application 
of multi-task learning in genomic selection has not been 
reported so far. 

Bayesian learning models using stochastic search vari- 
able selection (SSVS) has been widely used and proven 
effective for genomic prediction in a single population 
[1,19-21]. Normally, SSVS uses some types of spike and 
slab distributions as priors for SNP effects. A latent in- 
dicator variable (0 or 1) is associated with each SNP, 
with 0 indicating that the SNP is irrelevant to the trait 
and is excluded from the model, and 1 indicating that 
the SNP is associated to the trait phenotype and is included 
in the model. In this study, it was assumed that multiple 
populations share the same set of latent indicator variables 
which can be learned jointly. The goal was to develop a 
multi-task Bayesian learning model for multi-population 
genomic prediction and to evaluate its performance on 
both simulated and real data. 



Methods 

In this section, single-task Bayesian learning model, the 
simple data pooling method, and the multi-task Bayesian 
learning model were introduced. A Gibbs sampling 
algorithm was designed to implement the multi-task 
Bayesian learning model. Monte Carlo simulation 
studies were conducted to evaluate the performance of 
the proposed multi-task model. Real data on five milk 
production traits from Holstein and Ayrshire dairy 
breeds were also used to test the effectiveness of the 
multi-task model. 

Single-task bayesian learning model 

In a single reference population of n animals with 
genotypes on m SNP markers, the statistical model can 
be written as: 

m 

/=1 

Where y t is the phenotypic value for the /* animal 
(i = 1,. ..,«), u is the intercept, is the genotype for 
the / SNP locus (/' = l,...,m) of the i th animal, which is 
coded 0, 1 or 2, depending on the number of copies from 
a specified allele, a, is the regression coefficient for the 
jtn ( a n e i e substitution effect), and e, is the random 
residual error. 

A flat prior distribution is assigned to u. aj is assumed 
a mixture of a normal distribution N (0, cr^) and a point 
mass density at zero (denoted by a Dirac delta function 
8 0 (a.j) hereinafter). The weights for the two distributions 
are (1-w) and w, respectively, so that (aj\w, a^j~(l-w) 
N(0,al) + wdo(aj). w follows a uniform prior distribu- 
tion. A latent indicator variable y, is introduced for each 
SNP, so that when y,=i , aj~N(Q, cr^) , and when y ; =0 , 
aj=0. Prior distribution for each ji is assumed i.i.d. and 
follows Bernoulli distribution with probability (1-w). So 

the joint prior density for y is f{y\w) = Y\w^ r ^ (i-w) r ' . 

i 

Residual errors are assumed from a multivariate normal 
distribution N(0,ldf) . The prior distribution for a\ (afj 
is a scaled inverse Chi-square distribution with degree of 
freedom vjvj and scale factor s 2 a (s 2 ) . 

Simple data pooling method 

Suppose animals are from a number of c different popula- 
tions. In a simple data pooling method, animals from 
multiple populations are pooled together to form a single 
training data set. It is assumed that the population origin 
for each individual is known prior to the analysis. Popula- 
tion origin is included as a fixed effect. The effect of each 
SNP is assumed to be the same across populations. 
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Multi-task Bayesian learning model 

For c populations with n k animals in the k-th population, 
the statistical model can be written as: 



Vik = f*k + Yl Xi > ka i k 

7=1 



- e<* 



1, 



1. 



In matrix notation, this can be written as: 

Yk = M + X k a k + e k: 

where is the phenotypic value for the j animal in 
the k* population; \i k is the general mean of population 
k; Xij k is the genotype for the / SNP locus of the i th 
animal in the k* population; ettk is the f h SNP effect in 
population k, and e ik is the random residual effect; and 
in the matrix notation, y k , a k , and e k are vectors of pheno- 
typic values, SNP effects, and residual effects, respectively; 
1 is a vector with all elements set to 1; and X k is the design 
matrix relating y k to a k . In the model, a* allows the f h 
SNP effect to have a different value in population k. To 
share information among different populations, a com- 
mon latent indicator variable indicating whether SNP / is 
associated with a QTL is used across populations. Accom- 
modating these features into a Bayesian model produces 
the multi-task Bayesian learning model. 

The following prior distributions for the unknown 
parameters and hyper-parameters are assumed in the 
multi-task Bayesian learning model: 

(Mj.-flat distribution, 

MX;, <4) ~YjN{0, o 2 ak ) + So {a jk ) , 

{ Yj \w)~w l -ri{l-wy>, 

w~ Uniform (0, 1), 



Oak/2 



J2 \-(l+v**)/2 



r(v ak /2) V 

(e k \o 2 ek )~N(0,lo- 2 ek ), 



exp 



v «k s lk 



r(v ek /2) 



2r (l+, et )/2 /V^4 

cxpj 



The likelihood function of the whole data given all the 
parameters in the model is: 



ek) eX P 



.VekiYk-Pk-Xkak) (y k -\i k -X k a k ) 



So the joint posterior density function is: 



Gibbs sampling algorithm 

A Gibbs sampling algorithm was designed to draw 
samples for unknown (hyper-) parameters from their 
full conditional posterior distributions. To avoid reducibility 
of Markov chain, y ; and a.j k are jointly sampled by first 
sampling y, from f{ys\0j-,y) followed by sampling aj k 
from f(djk\ypSj-,y), where 8j- represents all parameters 
except Yj and aj k . Full conditional posterior distributions 
for fi k ,w,o 2 k and a 2 k can be derived by picking up the 
relevant parts from the joint posterior distribution. Deriv- 
ation for the density function f(yj\0j-,y) and sampling of 
y y are described in the Appendix. The Gibbs sampling 
steps are described as below: 

Step 1. Initialize the parameters w, y, o 2 ak , o~ 2 k ,fi k and a k . 
Step 2. For j=l,-, m 

a. Sample y ; from Bernoulli distribution with 
probability 1/(1+^), 



<t)k 



k ajk / 



in which 



x jk (Yk ~ t*k ~ X lk:jk- 1 a'lfcjk- 1 ~ X jk+ lank* jk+ l:mk ) 



ajk 



x Sk x )k + ^/ ff ^ 



and 



a 



ck 



"* *ik x jk + o 2 Ja 2 ak 
(see Appendix for details). 

b. For k=l,-c, sample aj k from 

( S 0 (aj k ) ifyj = 0 

Step 3. Sample w from Beta distribution 

/My) = Bf^gy^H 1 -"')^ 1 ) where or = m - E yj and ji = I 

Step 4. For k=l,—, c, sample a 2 k from scaled inverse 
Chi-square distribution 
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Step 5. For k=l,—, c, sample a 2 ek from scaled inverse 
Chi-square distribution 

J VekS 2 ek + (y k -fi k -X k a k ) (y k -fi k -X k a k ) 



Step 6. For k=l,-, c, 

Sample H from TV ( (^)V**) ; ^ 

Repeat Step 2 to 6 until a set number of iterations are 
reached. 

It can be shown in step 2 that information from 
all populations are used to generate the latent vari- 
able Yp an d the SNP effect aj k is generated for each 
population. 

Computer program 

Computer programs were written in C language to im- 
plement the multi-task Bayesian learning model, single- 
task Bayesian learning and the simple data pooling 
method. Programs and source codes are available upon 
request. 

Monte Carlo simulations 

The aim of the simulation was to evaluate the perform- 
ance of the proposed multi-task Bayesian learning model 
and to compare with the single-task model and the simple 
data pooling method. Different scenarios were considered 
that differ in number of QTL affecting the trait, corre- 
lations of the QTL effects between different popula- 
tions, and the density of SNP panels used for genomic 
prediction. 

Real genotypes from 458 Ayrshire and 2,298 Holstein 
bulls were used for simulations. All Ayrshire animals 
and 690 Holstein animals were genotyped on the Illu- 
mina BovineHD BeadChip (800 k) SNP panel, and the 
remaining 1,608 Holstein animals were genotyped on the 
Illumina BovineSNP50 BeadChip (50 k) SNP panel. SNPs 
meeting one of the following criteria were excluded: minor 
allele frequency (MAF) lower than 0.05, missing genotype 
rate greater than 0.10, highly correlated with any other 
SNP genotype (95% genotypes from two loci identical 
or in complementary). SNP locations were determined 
against Bovine genome assembly UMD3.1 [22]. SNPs 
with unknown locations or on sex chromosomes were 
discarded. SNPs filtered in one breed were also removed 
from the other breed. After editing, 246,668 SNPs from 
the 800 k panel and 28,206 SNPs from the 50 k panel were 
kept for analyses. 

For ease of computation, only SNPs from the first 10 
chromosomes were used. Four scenarios were considered 



with combinations of QTL number being either 20 or 200 
and the correlation of QTL effects between the two 
populations being low (p = 0.2) or high (p = 0.8). For 
each scenario, QTL were randomly sampled from the 50 k 
panel. For each QTL /, allele substitution effects in the 
two populations, an and oc^, were sampled from a 
bivariate normal distribution with mean 0 and variance- 

covariance structure £ = Q i)" 2 m wmcn o 2 = 1. Breed- 
ing value of each animal i, was calculated as a; = Xjaj 

J 

where Xj is the QTL genotypes coded as 0, 1 or 2, number 
of copies on an arbitrarily chosen allele. Total additive 
genetic variance were calculated within each breed as 

a\ = y^2p/l-p, )«?, where pj is the QTL allele frequency. 

/ 

For each animal i, a residual effect e„ was sampled from a 
normal distribution with mean 0 and variance a 2 , so that 
phenotypic value of the animal ^-a^+e,. a 2 was equal to 
a 2 to give a trait with heritability of 0.5. Each scenario 
was replicated 10 times. 

Four SNP panels of different density, the original 
800 k panel, and the mimicked 400 k, 200 k, 100 k by 
selecting every 2nd, 4th, and 8th SNP, respectively, from 
the 800 k panel were used for genomic prediction. For 
each scenario described above, the simulated QTL were 
removed from the SNP panels. A special scenario was 
designed to keep all QTL genotypes in the 800 k panel 
for genomic prediction. The 50 k genotypes of 1,608 
Holstein animals were imputed to 800 k using genotype 
imputation software Flmpute developed by Sargolzaei 
et al. [23] . Imputation accuracy was evaluated on a set of 
126 animals that have been genotyped on both the 50 k 
and 800 k panel, and the ratio of the genotypes that were 
correctly imputed was 0.9930. 

For training and validation purposes, 393 Ayrshire and 
2,084 Holstein animals born before 2004 were used as 
training set, 65 Ayrshire and 214 Holstein animals born 
in and after 2004 were used for validation. The number 
of animals used for genomic prediction is shown in 
Table 1. The simulated phenotypic values for training 
animals were used to derive SNP effects using different 
models. Degree of freedom of the inverse chi-square 
distributions for variances of SNP effects and residual 
effects were set to 4 and 10, respectively. The scale 
parameter S 2 was derived from the expected value of a 



Table 1 Number of animals used for genomic prediction 





Ayrshire 


Holstein 


Training set 


393 


2084 


Validation set 


65 


214 


Total 


458 


2298 
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scaled inverse chi-square distributed random variable, 
i.e., Eia 2 -) = S 2 vl{v - 2) and hence S 2 a = o 2 a {v a -2)/v a where 
a 2 a is the true additive genetic variance. S 2 was derived 
similarly. The Gibbs chain was run for 50,000 cycles with 
the first 10,000 discarded as burn-in. Marker effects were 
estimated as averages from all post burn-in samples. Gen- 
omic breeding values for animals in the validation set were 
estimated as the sum of population mean and all marker 
effects. Accuracy was evaluated as Pearson's correlation 
coefficient between genomic estimated breeding values 
and true breeding values for validation animals, i.e. r 
(GEBV, TBV). Regression of true breeding value on 
genomic estimated breeding values, b(TBV, GEBV), 
was also calculated to evaluate the bias of the genomic 
estimated breeding values. 

Real data 

Five milk production traits including milk yield, fat yield, 
protein yield, fat percentage and protein percentage were 
used for the same set of animals in the simulation study. 
Bull proofs (estimated breeding values from progeny 
testing, EBV) from April 2008 were used as phenotypes 
for training animals, and bull proofs from December 
2011 were used for validation animals. All proofs were 
provided by Canadian Dairy Network (CDN) and had 
reliability above 0.65. 

Two SNP panels with 28,206 SNPs from the 50 k 
panel and 246,668 SNPs from the 800 k panel were used 
for genomic prediction. Degree of freedom of the inverse 
chi-square distributions for variances of SNP effects and 
residual effects were set to 4 and 10, respectively. Scale 
parameters for the two distributions were derived in the 
same way as described in the simulation study, but instead 
of using true additive genetic variance and residual 
variance, estimated variances were used. Estimated addi- 
tive genetic variances and residual variance were obtained 
from ASReml [24] by fitting an animal model with a 
population mean, animal effect, and random residual 
effect. The pedigree contained 15,731 animals for the 
Holstein population and 4,926 animals for Ayrshire. 
For 28,206 SNPs, the Gibbs sampling procedure was 
run for 50,000 iterations with the first 10,000 discarded 
as burn-in; and for 246,668 SNPs, the Gibbs sampling 
procedure was run for 100,000 iterations with the first 
50,000 discarded as burn-in. Burn-in period was deter- 
mined by visually inspecting the Gibbs chain. All sam- 
ples after the burn-in period were kept. SNP effects 
were estimated by averaging all samples after the burn- 
in period. After the estimation of SNP effects, the GEBV 
was calculated for animals in the validation set by sum- 
ming up the population mean and all the SNP effects. 
Accuracy was measured as Pearson's correlation coefficient 
between GEBV and the 2011 bull proofs for validation 
animals. 



Results 

Monte Carlo simulation study 

Table 2 shows the accuracy of genomic prediction with 
simulated 20 QTL. In the Ayrshire validation set, multi- 
task Bayesian learning model performed the best among 
the three methods within each SNP panel used under 
the scenario with either a low (p = 0.2) or high (p = 0.8) 
correlation of simulated QTL effects between Ayrshire 
and Holstein populations. The greatest increase of accur- 
acy was observed when p was 0.8 and when density of the 
SNP panel was the highest (accuracy increased from 0.67 
for the single-task method to 0.83 for the multi-task 
method). Simply pooling data together substantially re- 
duced the prediction accuracy in Ayshire when p was 
0.2. The greatest reduction of accuracy was observed 
when p was 0.2 and when density of the SNP panel was 
the highest (accuracy decreased from 0.71 for the 
single-task method to 0.56 for the pooling method). 
When p was 0.8, the pooling method had an increased 
accuracy in Ayrshire compared with the single-task 
method. The pooling method produced substantially 
lower accuracy than the multi-task model in the Ayrshire 
validation set, especially when QTL effects were lower 
correlated between the two populations. In the Holstein 
validation set, the multi-task model performed similar or 
slightly better than the single-task model. The pooling 
method had a slightly reduction of accuracy when p was 
0.2, and had a similar accuracy compared with the single- 
task method when p was 0.8. Within each method, 
increasing density of the SNP panel generally improved 
prediction accuracy in both the Ayrshire and Holstein 
populations. Table 2 also shows the slopes of regression of 
true breeding values on the GEBV. Regression coefficients 
of true breeding values on GEBV were less than one 
for the pooling method indicating that the GEBV were 
inflated. 

Table 3 shows the accuracy of genomic prediction 
with simulated 200 QTL. In the Ayrshire validation 
set, the multi-task model had slightly higher predic- 
tion accuracy within each SNP panel compared with 
the single-task model under either scenario with low 
(p =0.2) or high (p =0.8) correlated QTL effects be- 
tween the Ayrshire and Holstein populations. Pooling 
method performed similar or slightly worse compared 
with single-task model when p was 0.2, and generally 
performed better when p was 0.8. The pooling method 
also performed slightly better than the multi-task 
model when p was 0.8 and when density of the SNP 
panel was relatively high. When p was 0.2, regression 
coefficients of true breeding values on GEBV were less 
than one for the pooling method indicating that the 
GEBV were inflated. The three methods performed 
similar in the Holstein validation set. Overall, the ac- 
curacy was lower compared with scenarios when only 
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Table 2 Accuracy {expressed as correlations between true breeding values (TBV) and genomic estimated breeding 
values [GEBV; r(TBV, GEBV)]}, and slopes [b(TBV, GEBV)] of regression of TBV on GEBV for genomic prediction with 20 
simulated QTL 

Ayrshire Holstein 
SNP panel Single-task Pooling Multi-task Single-task Pooling Multi-task 

rfTBV, GEBV) 
p = 0.2 



800 k 


0.71 ±0.02 


0.56 


±0.04 


0.75 ± 0.02 


0.91 ±0.01 


0.90 ±0.01 


0.91 ±0.01 


400 k 


0.64 ± 0.04 


0.53 


±0.04 


0.73 ±0.03 


0.90 ±0.01 


0.86 ±0.01 


0.90 ±0.01 


200 k 


0.60 ± 0.05 


0.50 


±0.04 


0.68 ± 0.02 


0.88 ±0.01 


0.84 ±0.01 


0.88 ±0.01 


100 k 


0.57 ±0.04 


0.47 


±0.04 


0.63 ± 0.03 


0.84 ±0.01 


0.81 ±0.02 


0.84 ±0.01 


p = 0.8 
















800 k 


0.67 ± 0.05 


0.76 


±0.02 


0.83 ±0.01 


0.92 ±0.01 


0.92 ±0.01 


0.93 ± 0.01 


400 k 


0.66 ± 0.05 


0.72 


±0.02 


0.80 ±0.01 


0.90 ±0.01 


0.89 ±0.01 


0.90 ±0.01 


200 k 


0.66 ± 0.04 


0.68 


±0.02 


0.76 ±0.01 


0.86 ±0.01 


0.86 ±0.01 


0.86 ±0.01 


100 k 


0.61 ±0.04 


0.63 


±0.04 


0.72 ±0.02 


0.83 ±0.01 


0.83 ± 0.01 


0.84 ±0.01 


b(TBV, GEBV) 
















p = 0.2 
















800 k 


1 .06 ± 0.05 


0.73 


±0.05 


1 .04 ± 0.05 


0.98 ± 0.02 


1.01 ±0.01 


0.98 ±0.01 


400 k 


1 .06 ± 0.05 


0.76 


±0.06 


1 .06 ± 0.05 


0.99 ± 0.02 


1.00 ±0.01 


0.98 ±0.01 


200 k 


1 .02 ± 0.06 


0.75 


±0.05 


1 .03 ± 0.04 


1.00 ±0.01 


0.99 ±0.01 


0.98 ±0.01 


100 k 


1 .00 ± 0.06 


0.75 


±0.05 


1 .03 ± 0.06 


1.00 ±0.01 


1 .00 ± 0.02 


0.99 ±0.01 


p = 0.8 
















800 k 


1.10 ±0.06 


0.90 


±0.03 


1.10 ±0.04 


0.99 ± 0.02 


1 .00 ± 0.02 


0.99 ± 0.02 


400 k 


1 .09 ± 0.06 


0.89 


±0.04 


1 .07 ± 0.04 


0.99 ± 0.02 


0.99 ±0.01 


0.99 ± 0.02 


200 k 


1.1 4 ±0.06 


0.89 


±0.06 


1 .04 ± 0.05 


0.98 ± 0.02 


1 .00 ± 0.02 


0.98 ± 0.03 


100 k 


1 .08 ± 0.08 


0.85 


±0.06 


1 .06 ± 0.05 


0.98 ± 0.03 


0.99 ± 0.03 


0.98 ± 0.03 



20 QTL were simulated regardless of the methods 
used. 

To evaluate the performance of the three methods 
under situations where all QTL genotypes can be ac- 
quired and included for genomic prediction, Table 4 
shows the accuracy of genomic prediction when QTL 
genotypes are included together with SNP marker 
genotypes for training and validation. When 20 QTL 
were simulated, using multi-task model improved the 
accuracy by 0.16 and 0.22 in the Ayrshire validation set 
for scenarios where correlation of QTL effects between 
the Ayrshire and Holstein populations was 0.2 and 0.8, 
respectively. Using the pooling method reduced the accur- 
acy in Ayrshire by 0.12 when p was 0.2, but increased the 
accuracy by 0.17 when p was 0.8. When 200 QTL were 
simulated, using the multi-task model increased the pre- 
diction accuracy by 0.03 and 0.07, respectively, for p equal 
to 0.2 and 0.8. The pooling method reduced the accuracy 
by 0.04 when p was 0.2, and increased the accuracy by 
0.11 when p was 0.8. The pooling method outperformed 
the multi-task model only for the scenario where 200 



QTL were simulated with their effects highly correlated 
between the two populations. In the Holstein validation 
set, the multi-task model had similar or slightly higher 
accuracy compared with the single-task model. The 
pooling method had similar accuracy as the single-task 
model when p was 0.8. When p was 0.2, the pooling 
method had slightly lower accuracy compared with the 
single-task model. Regression coefficients of true breed- 
ing values on GEBV were less than one for the pooling 
method except for the scenario of 200 simulated QTL 
with effects highly correlated between the two popula- 
tions, indicating that the GEBV predicted by the pooling 
method were inflated. 

Real data analysis 

Table 5 shows the accuracy of genomic prediction for 
milk production traits using real data. For the Ayrshire 
validation set, the multi-task model increased the accuracy 
by up to 0.07 compared with the single-task model, while 
the simple data pooling method resulted in reduced accur- 
acy in general. The greatest increase in accuracy using 
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Table 3 Accuracy {expressed as correlations between true breeding values (TBV) and genomic estimated breeding 
values [GEBV; r(TBV, GEBV)]}, and slopes [b(TBV, GEBV)] of regression of TBV on GEBV for genomic prediction with 200 
simulated QTL 



SNP panel 


Ayrshire 






Holstein 






Single-task 


Pooling 


Multi-task 


Single-task 


Pooling 


Multi-task 


rfTBV, GEBV) 














p = 0.2 














800 k 


046 ± 0.02 


0.44 ± 0.04 


0.47 ± 0.03 


0.77 ±0.01 


0.76 ±0.01 


0.77 ±0.01 


400 k 


0.46 ± 0.02 


0.43 ± 0.03 


0.47 ± 0.02 


0.76 ±0.01 


0.76 ±0.01 


0.76 ±0.01 


200 k 


0.46 ± 0.02 


0.42 ± 0.04 


0.47 ± 0.02 


0.75 ±0.01 


0.75 ± 0.01 


0.75 ± 0.01 


100 k 


0.45 ± 0.02 


0.41 ± 0.03 


0.46 ± 0.03 


0.74 ±0.01 


0.74 ±0.01 


0.74 ±0.01 


p = 0.8 
800 k 


0.54 ± 0.04 


0.57 ± 0.02 


0.56 ± 0.03 


0.74 ± 0.02 


0.75 ± 0.01 


0.75 ± 0.02 


400 k 


0.54 ± 0.03 


0.56 ± 0.03 


0.55 ± 0.03 


0.74 ± 0.02 


0.74 ± 0.02 


0.74 ± 0.02 


200 k 


0.54 ± 0.03 


0.56 ± 0.02 


0.55 ± 0.03 


0.73 ± 0.02 


0.73 ± 0.02 


0.73 ± 0.02 


100 k 


0.53 ± 0.03 


0.52 ± 0.02 


0.54 ±0.03 


0.72 ± 0.02 


0.72 ± 0.02 


0.72 ± 0.02 


b(TBV, GEBV) 














p = 0.2 














800 k 


1.1 4 ±0.08 


0.89 ± 0.09 


1.1 6 ±0.09 


1.07 ±0.01 


1.09 ±0.01 


1.07 ±0.01 


400 k 


1.1 4 ±0.09 


0.91 ±0.08 


1.1 3 ±0.09 


1.07 ±0.01 


1.09 ±0.01 


1.07 ±0.01 


200 k 


1.14 + 0.09 


0.88 ± 0.08 


1.1 4 ±0.09 


1.08 ±0.01 


1.09 ±0.01 


1.08 ±0.01 


100 k 


1.1 5 ±0.09 


0.89 ± 0.09 


1.1 5 ±0.09 


1 .09 ± 0.02 


1.11 ±0.03 


1 .09 ± 0.02 


p = 0.8 

800 k 


1 .1 2 ± 0.1 1 


1 .00 ± 0.06 


1 .07 ± 0.09 


1.01 ±0.02 


1 .00 ± 0.02 


1.01 ±0.02 


400 k 


1.11 ±0.11 


1 .02 ± 0.07 


1 .08 ± 0.09 


1.01 ±0.02 


1 .00 ± 0.02 


1.01 ±0.02 


200 k 


1 .1 2 ± 0.1 1 


1.04 ±0.07 


1.11 ±0.09 


1.01 ±0.02 


1.01 ±0.02 


1.01 ±0.02 


100 k 


1.11 ±0.11 


0.99 ±0.07 


1.10 + 0.10 


1.01 ±0.02 


1.01 ±0.02 


1.01 ±0.02 



Table 4 Accuracy {expressed as correlations between true breeding values (TBV) and genomic estimated breeding 
values [GEBV; r(TBV, GEBV)]}, and slopes [b(TBV, GEBV)] of regression of TBV on GEBV for genomic prediction with 
simulated QTL genotypes included for training and validation 







Ayrshire 






Holstein 






No. of QTL 


P 


Single-task 


Pooling 


Multi-task 


Single-task 


Pooling 


Multi-task 


rfTBV, GEBV) 
















20 


0.2 


0.76 ± 0.04 


0.64 ± 0.04 


0.92 ±0.01 


0.96 ±0.01 


0.93 ±0.01 


0.97 ±0.01 


20 


0.8 


0.71 ±0.05 


0.88 ± 0.02 


0.93 ±0.01 


0.97 ±0.01 


0.97 ±0.01 


0.97 ±0.01 


200 


0.2 


0.57 ± 0.03 


0.53 ± 0.03 


0.60 ± 0.04 


0.77 ±0.01 


0.76 ±0.01 


0.78 ±0.01 


200 


0.8 


0.54 ± 0.04 


0.65 ± 0.02 


0.61 ±0.04 


0.76 ± 0.02 


0.78 ± 0.02 


0.77 ±0.01 


b(TBV, GEBV) 
















20 


0.2 


1.01 ±0.06 


0.83 ± 0.08 


0.99 ± 0.03 


1.00 ±0.01 


1 .00 ± 0.02 


1.00 ±0.01 


20 


0.8 


1 .04 ± 0.08 


0.89 ± 0.06 


0.97 ± 0.03 


0.98 ±0.01 


1.01 ±0.02 


0.99 ±0.01 


200 


0.2 


1.23 ±0.10 


0.89 ± 0.06 


1.1 6 ±0.08 


1 .00 ± 0.03 


1 .03 ± 0.03 


1.01 ±0.03 


200 


0.8 


1.10 ±0.09 


1 .06 ± 0.06 


1.1 2 ±0.09 


0.98 ± 0.03 


0.97 ± 0.03 


0.98 ± 0.02 
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Table 5 Accuracy of genomic prediction of breeding values for milk production traits 


Trait 


Ayrshire 






Holstein 






Single-task 


Pooling 


Multi-task 


Single-task 


Pooling 


Multi-task 


No. of SNP: 28,206 














Milk yield 


0.52 


0.44 


0.54 


0.66 


0.65 


0.66 


Fat yield 


0.64 


0.55 


0.66 


0.63 


0.64 


0.63 


Protein yield 


0.70 


0.60 


0.70 


0.69 


0.69 


0.68 


Fat % 


0.66 


0.58 


0.72 


0.74 


0.74 


0.74 


Protein % 


0.48 


0.51 


0.55 


0.67 


0.68 


0.67 


No. of SNP: 246,668 














Milk yield 


0.54 


0.53 


0.55 


0.64 


0.64 


0.64 


Fat yield 


0.67 


0.62 


0.67 


0.63 


0.64 


0.63 


Protein yield 


0.72 


0.68 


0.72 


0.66 


0.66 


0.66 


Fat % 


0.66 


0.65 


0.69 


0.77 


0.77 


0.78 


Protein % 


0.51 


0.42 


0.53 


0.71 


0.68 


0.70 



multi-task model compared with single-task model was 
for fat percentage (0.06) and protein percentage (0.07) 
when 28,206 SNPs were used. For the Holstein validation 
set, the single-task model, simple data pooling method, 
and the multi-task model performed similar regardless of 
the traits studied. 

Discussion 

Traditionally, genomic prediction with data from multiple 
populations were implemented either by running genomic 
prediction within each population (single-task) or by sim- 
ply pooling data together. Single-task genomic prediction 
cannot utilize information from other populations and 
therefore, the accuracy of genomic prediction is largely 
determined by the size of training data set [2,3]. For 
breeds with only a small number of animals having both 
DNA marker genotypes and phenotype data, the accur- 
acy of genomic prediction can be low [25]. Combining 
the data with other breed populations has the potential to 
improve the prediction accuracy. It is however, difficult 
to effectively account for the differences of SNP effects 
among different populations by simply pooling data 
together. If the marker density is low or the populations 
are divergent from each other, simply pooling data to- 
gether may result in unfavorable prediction accuracies 
[6]. The multi-task Bayesian learning model proposed in 
this study uses information from all populations simultan- 
eously while allowing the SNP effects to vary in different 
populations. Different populations share information 
through a common set of latent indicator variables. When 
the target trait has a similar genetic background in related 
populations, it is reasonable to assume that some shared 
QTL affecting a common trait in different populations. 



However, the linkage disequilibrium phase between SNP 
markers and QTL are likely to be inconsistent, especially 
when the marker density is low. Therefore, the multi-task 
Bayesian learning model is more flexible about the SNP 
effects and is likely to have better performance than a 
simple data pooling method. 

Results from simulation studies support the use of 
multi-task Bayesian model for multi-population genomic 
prediction especially when there are a few QTL affecting 
the trait. For the scenario where a few QTL affect the 
trait, the increase of accuracy by using multi-task model 
was greater when QTL effects had a higher correlation 
between two populations. The accuracy was further in- 
creased when QTL genotypes were included for training 
and validation. These results are expected as the higher 
the correlation of QTL effects between two populations, 
the more informative information the two populations 
can share. Including QTL genotypes also increased the 
amount of information to be shared between the two 
populations. Results suggest that the proposed multi-task 
Bayesian learning model is effective in combining infor- 
mation from multi-populations to improve accuracy of 
genomic prediction. 

Results from simulation also showed that simply pooling 
data together may reduce the accuracy of genomic predic- 
tion if QTL effects were lowly correlated between two 
populations or if a relatively low density SNP panel was 
used. When QTL effects had a high correlation between 
two populations and the SNP panel was high, the pooling 
method increased the accuracy compared with single-task 
model. When the correlation of QTL effects between two 
populations was high, the pooling method was inferior to 
the multi-task model if the number of QTL was small, but 
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outperformed the multi-task model if the number of QTL 
was large. These results are reasonable since the pooling 
method assumes the same SNP effects among different 
populations. This assumption will be violated if the 
correlation of QTL effects is low between two popula- 
tions and may result in poor performance of the pool- 
ing method. The multi-task model proposes to share 
information through the same latent indicator vari- 
ables. Correlations of QTL effects among different 
populations are not explicitly used and therefore, in- 
formation across populations may not be fully utilized 
if such correlations are high. This might be why the 
pooling method still outperformed the multi-task model 
when the number of QTL is large and the correlation of 
QTL effects between two populations was high. 

Results from real data analyses in this study showed 
that the multi-task Bayesian learning model produced a 
similar or higher accuracy compared to the single-task 
model, and that simply pooling data together resulted in 
a reduced accuracy when the marker density was low. 
SNP effects could be different across populations, espe- 
cially when lower density markers were used [6]. These 
results are in agreement with results from simulation 
studies. 

Gains of accuracy by using the multi-task model were 
higher for fat percentage and protein percentage traits 
than for other traits. This is likely due to that large QTL 
or genes such as DGAT1, have larger influence on the 
percentage traits than on the yield traits [26,27]. These 
results agreed with simulation studies which showed that 
the multi-task model performed better when there are 
fewer QTL affecting the trait. 

In this study, only two dairy cattle breeds were consid- 
ered, and the multi-task Bayesian learning model was 
shown to be effective and more beneficial to the population 
with a smaller data size. For the larger population, using 
multi-task model did not produce much improvement. The 
little improvement in the larger population could be due 
to that the smaller population is too small to be able to 
have a significant impact on the larger population. In 
practice, there are situations where many populations are 
to be combined for analysis with each one contributing a 
small amount of data, a typical example as in beef cattle 
production. It would be interesting to test the performance 
of the multi-task model under such scenarios. 

The current multi-task model considered additive gen- 
etic effects only. Non-additive genetic effects, which have 
gained growing interests with availability of genomics 
information [28-30], can also be accommodated into the 
multi-task model. A similar strategy used in this study 
to sharing information across populations for additive 
genetic effects may also be applied to non-additive genetic 
effects. Inclusion of non-additive genetic effects in the 
multi-task model warrants further investigation. 



The proposed multi-task Bayesian learning model used 
a spike and slab mixture distribution to conduct variable 
selection and shrinkage for SNP effects. This prior set- 
ting is similar to that in the BayesCn method proposed 
by Habier et al. [21]. Other types of mixture distribu- 
tions currently being used for genomic prediction 
[1,19,20], can also be adapted to the multi-task model. 
The strategy used in the multi-task Bayesian learning 
model allows different populations to share informa- 
tion through a common latent variable assumed for 
each SNP. Such strategy has been shown effective in 
both simulation and real data studies; however, other 
strategies may also be exploited, for example, by mod- 
eling the joint distribution of the SNP effects among 
different populations. Further investigations are required 
to evaluate alternative strategies for sharing information 
across populations. 

In this study, the Bayesian models were implemented via 
a Gibbs sampling algorithm adopting the residual-update 
computing strategy proposed by Legarra and Misztal [31]. 
Computing time for this algorithm is proportional to the 
number of animals and number of SNPs in the model. For 
the training sample size of 2,477 animals and genotypes of 
246,668 SNPs, the proposed multi-task model requires 44 
CPU hours to complete 150,000 Gibbs sampling cycles on 
a Linux cluster system with Intel X5675 3.07GHz CPU. 
With increased training sample size and increasing marker 
density to a very high density panel or even sequence data, 
computing burdens would become a concern. A recent 
study [32] has improved the residual-update algorithm 
resulting in the CPU time reduced by 35.3 to 43.3%. 
The authors in the same study [32] also proposed an alter- 
native algorithm which reduced CPU time by 74.5 to 
93.0%. Approximation algorithms based on expectation- 
maximization (EM) [33-35] and variational Bayes [36] 
have also been proposed to replace the time consuming 
Markov chain Monte Carlo (MCMC) sampling based 
algorithms. Adapting these algorithms to accommodate 
the multi-task Bayesian learning model would be of 
great interest. 

Conclusions 

A multi-task Bayesian learning model was proposed for 
multi-population genomic prediction. The multi-task 
model shares information across populations through 
a common set of latent indicator variables while allow- 
ing the SNP effects to vary in different populations. 
Simulation studies and real data analysis suggest that 
the proposed multi-task Bayesian learning model is 
effective and beneficial to populations where a small 
number of training animals are available. Accuracy of 
genomic prediction in small populations can be improved 
by using the multi-task model especially for traits affected 
by a few QTL with large effects. 
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Appendix 

Sampling of r, in the multi-task Bayesian learning 
model 

With the assumption of independence between 0,~ and 
Tj, by Bayes' theorem, one has 



f(ri = m,y) 



/(yky = i,e r )/(>) = i) 

7(yl0 = i= e r)/(o = i) +/(ylo = o,e r )f{r, = o) 
i 

" i + 1, i ' 

(A.l) 



where 



= /(ylr / = 0,e r )/(r ; = 0) 
/(y|r /= l,e r )/(r 7 = 1) 

= /(o = o) 1 £ r /(ykk/ = o,e,- k ) 
/(o = i) M/(y k k; = i,e rk )- 



Similarly, 
/(>>• = 0|8 r ,y) 



(A.2) 



l + q, 



Next, the conditional likelihoods /(yJr/ = O,0j-k) and 
/(yk| r ; = 1> Oj-fc) will be derived. 

Suppose one is at the {l+l) th Gibbs sampling iteration, 
and wants to sample yi and a^, the linear regression 
model given all parameters except yy and fl,* can be 
written as: 

Y*k = Xjk<ijk + e*, 

Where yj = y*-^-Xij^ia^_ 1 -It> + i 31 ^ +1 . jrt . 

Given the priors that (fl^ly^-y^O, cr^)+ (l-y^r5 0 

(ay*) and (e k |o"^.)~AT(0, I<r^) , the conditional likelihood of 
y k can be written as: 



/(y k |r ; = 0,8 rk )= {2na 2 ek y" k/2 exp 



Yk Yk 

2<4 



(A.3) 



Then, V r * and V -, 1 are derived. 



Oak • Oak , T 
Xjk % hi 



By applying Sylvester's determinant theorem, one has 



V v 



'aft 
7 2 



(A.5) 



It can be easily verify that V"^ 1 is: 

XjkXjk \ 



Substituting A.5 and A.6 into A.4, one has 
/( y |r ; = l,e r ) = (2^)- H * /2 fx jk x jk J+l 



(A.6) 



'ek 



-1/2 



exp< 



Denote 



Yk Yk" (xjkYk) V (x, k Xjk + ° 2 J a lk 



2< 



(A.7) 



X /k (yk~ X lk:ik-l ajfcjk-1 - X )t+l:ml<» 



jk+l:mky » 2 



one gets 



, and substitute (A.7) and (A.3) into (A.2), 



(A.8) 

Finally, y,- can be drawn from a Bernoulli distribution 
with probability l/fl+q,-). 
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And 

/(y k |r; = l,0 rk ) = (2^)-^ 2 |v y .p 1/ %x P ^- yk ^^ 

(A.4) 

Where V y > = Xj k Xj k cr^ + Ia 2 ek is the (co-) variance 
matrix of y k given that r ; =l. 
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